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Abstract 

Inverse problems lend themselves naturally to a Bayesian formulation, in which 
the quantity of interest is a posterior distribution of state and/or parameters 
given some uncertain observations. For the common case in which the forward 
operator is smoothing, then the inverse problem is ill-posed. Well-posedness 
is imposed via regularisation in the form of a prior, which is often Gaussian. 
Under quite general conditions, it can be shown that the posterior is absolutely 
continuous with respect to the prior and it may be well-defined on function 
space in terms of its density with respect to the prior. In this case, by con- 
structing a proposal for which the prior is invariant, one can define Metropolis- 
Hastings schemes for MCMC which are well-defined on function-space ([H, 
and hence do not degenerate as the dimension of the underlying quantity of 
interest increase to infinity, e.g. under mesh-refinement when approximating 
PDE in finite-dimensions. However, in practice, despite the attractive theo- 
retical properties of the currently available schemes, they may still suffer from 
long correlation times, particularly if the data is very informative about some 
of the unknown parameters. In fact, in this case it may be the directions of 
the posterior which coincide with the (already known) prior which decorrelate 
the slowest. The information incorporated into the posterior through the data 
is often contained within some finite-dimensional subspace, in an appropriate 
basis, perhaps even one defined by eigenfunctions of the prior. We aim to ex- 
ploit this fact and improve the mixing time of function-space MCMC by careful 
rescaling of the proposal. To this end, we introduce two new basic methods 
of increasing complexity, involving (i) characteristic function truncation of high 
frequencies and (ii) hessian information to interpolate between low and high fre- 
quencies. The second, more sophisticated version, bears some similarities with 
recent methods which exploit local curvature information, for example RMHMC 
0, and Stochastic Newton [4]. These ideas are illustrated with numerical ex- 
periments on the Bayesian inversion of the heat equation and Navier-Stokes 
equation, given noisy observations. 
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1. Introduction 



Sampling distributions over high-dimensional state-spaces is a notoriously 
difBcult problem, and a lot of effort has been devoted to developing methods to 
do this effectively 0, S 0, 0. If a given target distribution is known pointwise, 
then Markov chain Monte Carlo is a general method which prescribes a Markov 
chain having this target distribution as its invariant distribution. Examples 
include Metropolis-Hastings schemes, developed first by 8] and later generalized 
for arbitrary proposals by [9[ . It is well known that Metropolis- Hastings schemes 
developed for finite-dimensional systems degenerate in the limit of inference on 
infinite-dimensional spaces and a lot of work has been devoted to examining 
the actual rate with which they degenerate . Indeed this analysis has lead to 
improved performance of chains in finite-dimensions through arguments about 
optimal scaling of the acceptance probability. Recently, it has been shown in 
a series of works by 0, H [H that it is possible to overcome this limitation 
by defining the Metropolis-Hastings scheme in function space, thus confronting 
the issue of high-dimensions head on. This is particularly useful, for example, 
in the case that the system is a finite-dimensional approximation of a PDE. 
In such a case, the convergence properties of the function space scheme are 
independent of mesh refinement which increases the dimension of the finite- 
dimensional approximating system. However, it may (and often does in practice) 
happen that, even though the scheme is defined on function-space and is hence 
independent of mesh refinement, the integrated autocorrelation may be quite 
large for certain degrees of freedom, in particular those which have small effect 
on the likelihood. Hence, there is clearly a paradox in that the degrees of freedom 
which are not informed by the data, and hence are governed by the prior, are 
the rate limiting factor in the convergence of MCMC. But, at the same time, 
more is known a priori about such degrees of freedom. We aim to maintain 
the positive aspect of being defined in function space while removing the deficit 
of mixing poorly in directions which are better known a priori. This will be 
accomplished by appropriate rescaling based on curvature which leverages a 
priori known information. Inspired by the works 0, 0, [H we will focus on 
sampling distributions ^ over some Hilbert space X which have density with 
respect to a Gaussian reference measure of the following form 



where $ : X — > M, and Z = 1/ J-^ cxp[— <l'(M)]/io((iu). For simplicity, we let 



and we will further assume that G{u) is a compact operator. This function arises 
as the negative log likelihood of m /io given a realization of an observation 
y = G{u) -\-rj, where 77 ^ A/'(0, 7^/) independent of u. The posterior distribution 
on u, conditioned on the particular observation y, is given by (jl.ip . 



^,{du) — Zexp[— $(u)]/io((iu). 



(1.1) 



Hn)^y^\G{u)-y\\ 



(1.2) 
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The rest of this paper will be organized as follows. In Section [5J we will pro- 
vide a review of necessary concepts, including the recently developed function- 
space MCMC method pCN 0] which will serve as a starting point for this work. 
In Section |3] we illustrate the remaining decorrelation issues pertaining to pro- 
posal chains of the pCN form. In Section 2] we introduce a class of constant 
operator rescaled proposals which address the issues highlighted in Section [3] 
and yet still fit into the function space sampler framework, hence taking full 
advantage of its benefits. In Section [SJ we investigate the performance of these 
algorithms computationally against standard pCN for the inverse heat equation 
and inversion of Navier-Stokes equation. Finally, we provide conclusions and 
possible future directions in Section [51 

2. Background 

In this section we give some background on Markov chain Monte Carlo 
(MCMC) methods, and in particular the Metropolis-Hastings (MH) variants. 
We then introduce the preconditioned Crank-Nicholson (pCN) proposal that 
yields an MH algorithm which is well-defined on function spaces for a wide 
range of posterior distributions of the form (jl.ip , for example those arising from 
the Bayesian interpretation of solution to an inverse problem with Gaussian 
prior and observational noise. Finally, we will describe the way that correla- 
tions enter into approximations based on the resulting set of samples and define 
the effective sample size, which will be relevant in the derivation of the new 
methods which follow. 

2.1. MCMC 

MCMC methods aim to sample from a target distribution fi over H by 
designing a Markov transition kernel V such that fi is invariant under the action 
of -P 



with shorthand {fiV = /i), where the integral on the lefthand side is with respect 
to du. The condition known as detailed balance between a transition kernel V 
and a probability distribution fi says that 



Integrating with respect to m, one can see that detailed balance implies f/P = jj,. 
Metropolis-Hastings methods prescribe an accept-reject move based on propos- 
als from an arbitrary transition kernel Q in order to define a kernel V such that 
detailed balance with an arbitrary probability distribution is guaranteed. In 
order to make sense of MH in infinite dimensions, first define the measures [lo| 




(2.1) 



^{du)P{u, dv) — ^{dv)V{v, du), \f u,v d H. 



(2.2) 



vldu, dv) 
v^{du, dv) 



Q{u, dv)^(du) 
Q{v,du)iJ,{dv). 



(2.3) 
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Provided v-^ ^ i/, one can define the MH kernel as follows. Given current state 
Un, a proposal is drawn u* ^ Q(un, ■), and then accepted with probability 

a{un,u*) = min |l, ^^(u„,u*)| . (2.4) 

The resulting chain is denoted by P and one can see that it satisfies the necessary 
detailed balance (|2.2I) with respect to /i. Properly speaking 



P{u,dv) = a{u,v)Q{u,dv) + du{dv) {1 - a{u,w))Q{u,dw) (2.5) 



X 

Under an additional assumption of geometric ergodicity, one has that \V^{uo, •) — 
^1 — > for any uq G H Therefore, after a sufficiently long equilibration 

period, or burn-in, one has a prescription for generating samples from the dis- 
tribution ^, from which integrals with respect to can be approximated. These 
samples are correlated, however, and this will be revisited below. 

Since there is great freedom in choice of proposal kernel Q, one can imagine 
that there is great discrepancy in the performance of the algorithms resulting 
from different choices. It is popular to choose symmetric kernels, such as the 
classical random walk algorithm, in which the difference between the current 
and proposed states is taken to be Gaussian distributed. For this proposal, the 
absolute continuity condition ^ is violated, so the MH algorithm is defined 
only in finite-dimensions [l[ and degenerates under mesh-refinement. However, 
a small modification to the classical random walk proposal: 



Q(u„,-) -A/'(v/l-/3'^n,/?'C') (2.6) 

yields an algorithm which satisfies ly^ i' and is defined on function space. 
This algorithm is referred to as pCN in Q. The condition that z/-*- <C for 
algorithms on function space ensures that the methods are well-behaved with 
respect to mesh-refinement; only such methods are considered in this paper (see 
01 for more details on that point). 

2.2. Correlations 

Let i? : 7J — R, and suppose we would like to estimate the integral E(i?) = 
R{u)^{du) by N samples i?„ — f{un), where w„ fi. Denoting this estimate 
by ^, we have 

1 ^ 

n=l 

Then E(^) = ]E(i?), so the estimator is without bias. Furthermore, we denote 
the variance of R by 

Y{R) = E[{R - E{R)){R - E{R))]. 

So, we also have that 

V(i?) = E[{R - E{R)){R - E{R))]. 
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Wc restrict attention to the case E(i?) = 0, for ease of exposition only. Assume 
also that the samples arc correlated in such a way that 

pn = E{RmRm+n)/^{RmRm) Vm, U. 

For example, such samples arise from a sample path of a homogeneous Markov 
chain, such as in MCMC. Then 

V(^) = E[RR] 

= (1/^^) (Er^=l RnRn + J2n,m=l,nyim RnRmj ,^ 
« (l/iV)V(i?)(l + 2Eti/Cn) 

where we assumed p„ for all n > N* with N* 4C N, and we denoted 
6 = J2n=i Pn- If the Rn are independent then ^ = and we recover constant 
prefactor 1 to Y{R)/N. Otherwise, the constant prefactor reduces the effec- 
tive number of samples with respect to the case of independent samples. It is 
therefore informative to define the effective sample size by iVe// = ^/(l + 2^). 
Phrased another way, one needs 1 + 29 samples in order to get another inde- 
pendent sample. Hence, p„ and are referred to as the autocorrelation function 
and the integrated autocorrelation, respectively. The general case is dealt with 
similarly, working elementwise. In this case, the effective sample size is given 
by the minimum over effective sample sizes of individual degrees of freedom, 
strictly speaking. 

From the above discussion, it is clear that the performance of an MH algo- 
rithm in approximating integrals with respect to a distribution is limited by the 
correlation between the samples it generates. The rest of the paper will concern 
this point about correlations. 

3. Proposal Chains for MCMC 

In this section, we explore some drawbacks of the function-space sampling 
algorithm defined in the previous section. We will first motivate the need for 
modification for a simple 1 dimensional chain. Then, we describe how this same 
idea lifts into higher dimensions. 

3.1. Principle in Id 

As an example, consider the following Markov Chain 

where Xq, Wn ~ ^^(0, a^) i.i.d. for all n and /3 < 1. So, EX„ = for all n, and 

n-l 

Xn = {l- P^r^^Xo + ^ ^(1 - 

fe=0 
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decorrelation with (Z) and without (X) accept/reject decorrelation with (Z) and without (X) accept/reject 




x10^ 



Figure 1; Illustration of slow decorrelation in proposal chain X, and fast decorrelation in 
the informed chain Z. Left plot shows autocorrelations, as compared also to the analytical 
autocorrelation of X. Right plot shows a segment of the evolution over which the {Xn} have 
barely decorrelated, but the {Zn} have clearly decorrelated within a small segment. 

Therefore, p„ = E(XoX„)/E(^o^o) = (1 - /3^)"/^ ~ expi-nP"^ /2), and 6 = 
'^n=i ~ 2//?^. Notice also that this is independent of a. 

Now, consider the Metropohs- Hastings chain Z with proposal given by X, 

i.e. 

Z* = {l-p^y^^Zn + l3Wn 

Zn+i - a5{Z* - •) + (!- a)6{Z,, - •), a = lA exp(0(Z„) - cj){Z*)). 

We run an experiment. Let (j>{z) = {l/2j'^){z — 1)^, with 7 = 0.01 and a = 1. 
So, the above chain samples the posterior distribution with observation y = 
z + A/'(0,7^) — 1 and prior Z ^ A/'(0, 1). Figure [T] illustrates the decorrelation 
time and evolution of the proposal chain, X, and the informed chain Z for a 
given choice of /3 which yields reasonable acceptance probability. The speedup 
is astonishing, and indeed counterintuitive: one only accepts or rejects moves 
from the proposal chain, so intuition may suggest that the informed chain would 
have larger autocorrelation. But actually, if the distribution is dominated by 
the likelihood, then the integrated autocorrelation is approximately the inverse 
of the expected acceptance, which is approximately 1 by construction. On the 
other hand, if the distribution is dominated by the prior, then the integrated 
autocorrelation is dictated by the proposal chain. If the state-space were two 
dimensional, with an observation only in the first dimension, then for a given 
choice of scalar j3 the chain will mix like Zn in the observed component and 
like Xn in the unobserved component. The same idea extrapolates to higher 
dimensions, motivating the use of direction-dependent proposal step-sizes. 

3.2. Higher dimensions 

Recall in Id that if the distribution is determined by the likelihood then the 
integrated autocorrelation is approximately the inverse of the expected accep- 
tance, and if the distribution is determined by the prior then the integrated 
autocorrelation is dictated by the proposal chain. Suppose that some of the 
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space is dominated by the prior, and some is dominated by the hkehhood. If 
the observations have small variance, then the step-size j3 in the chain will have 
to be small in order to accommodate a reasonable acceptance probability. But, 
the part of the space which is dominated by the prior will have 0{1/ 0^) auto- 
correlation, as dictated by the proposal chain. 

Now consider the Bayesian inverse problem on a Hilbert space H . Let 
denote an orthonormal basis for the space H. Let P project onto 
a finite basis {fj^'j^i and Q = I — P. Suppose we wish to sample from the pos- 
terior distribution /i with prior distribution /j,Oj where /io(Q-)) ^ ^ for 
some e <C 1 and some distance function •) between probability measures (for 
example Hellinger metric or Kullback-Leibler distance). Thus P projects onto 
the "important part" of the space, in the sense that the observations only inform 
the projection of the posterior distribution on this part of the space. Suppose 
we use Metropolis-Hastings to sample from posterior /i with prior /io = A/'(0, C). 
Let uo, Wn ~ A/'(0, C), so that the chain 

Un+l ^ {1 - P^y/^Un + l3Wn (3.1) 

leaves the prior invariant. Without loss of generality, we represent this chain 
in the basis in which C is diagonal, so the proposal for each degree of freedom 
is an independent version of the above and clearly has decorrelation time 1 -I- 
26 Ki 4//3^. We impose accept/reject using the above chain as proposal, so 
we are sampling from fi{Q-) with decorrelation time (1 -I- 26) /Ka where Ea is 
the expected acceptance probability (the acceptance rate imposes additional 
correlation on top of the proposal chain). But, the distribution /i « /ip on QH, 
so we may as well sample independently from this known distribution in this 
part of the space. 

One can consider modifying the above proposal chain to sample from the 
pCN chain (jS.ip on PH and sample independently from the prior on QH. This 
corresponds to setting the chain above to 

Un+l = (1 - Py^^PUr, + {PP + Q)W^. (3.2) 

Our first and simplest variant of weighted proposal follows directly from this. 
In the next section we will consider a more elaborate extension of this proposal 
using curvature information. 

4. Operator v^reighted proposal 

Inspired by the discussion in the preceeding section, we let i?„ be an operator 
weighting different directions differently, and define the proposal chain to be 

Un+l = ^/KlUn + ^Jl - BnWn- (4.1) 

Notice that uq, Wq ~ A/'(0, /) u„ - 7V(0, /) Vn. 

Recall the form of distribution we consider, given by (|l.ip and (|1.2p . We will 
consider prior densities of the form u ^ A/'(0, C). If the prior covariance C I, 
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then a change of variables u — > C^^'^u is imposed to whiten the prior, such that 
a chain of the form (|4.1I) keeps the prior invariant. The samples of the resulting 
MH chain are then transformed back as it ^ C^^^^u for inference. 

For the moment, we revert to the case in which H = M.^ in order to motivate 
the form of the proposals. In this case we have /i ^ A, where A is Lebesgue 
measure, and after transformation 

= ^\y~GiC'/'u)\' + l\u\' + Kiy). 

At a given point w the local curvature is given by the Hessian of this functional, 
which can be approximated by the positive definite operator 

^C'/^Dg{C'/^wrDgiC'/^w)C'/^ + 1. (4.2) 

If {Afc,(/>fc} are eigenpairs associated to this operator, then: 

• if A/c 3> 1, then the distribution is dominated by the data likelihood in the 
direction 0^; 

• if Afe ~ 1, then the distribution is dominated by the prior in the direction 

The direction of the largest curvature at a given point will determine the smallest 
scale of the probability distribution at that point. If Ai is the largest eigenvalue 
of (|4.2I) . then the smallest local scale of the density will be 1/Ai. This scale 
will directly determine the step-size necessary to obtain a reasonable acceptance 
probability within the MH algorithm. But, such a small step-size is only neces- 
sary in the direction </>!. So, rather than prescribing this as the stepsize in the 
remaining directions, we would like to rescale them according to the operator 
given in (|4.2p . 

The log-density is not well-defined in function space, but the above operator 
(|4.2p is, and it will be the basis for choosing Bn in (|4.1[) . Based on the above 
intuition, one can devise a whole heirarchy of effective such operators. Indeed, 
one can allow for Bn constant, either during the duration of the algorithm, after 
a transient adaptive phase, or even for some lag-time where it is only updated 
every so often. In this manuscript we consider only constant B, and we consider 
three simple levels of complexity. The rigorous theory underlying the algorithms 
will be deferred to future work, although we note that the case of constant B is 
easier to handle theoretically, since one does not need to consider the conditions 
for ergodicity of adaptive MCMC (l^ . 

The first level of complexity consists of letting 

B={l-P^)I. (4.3) 

The resulting method corresponds to the original pCN proposal, and we denote 
this method by O. 



-loe 



(u) 
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The second level of complexity consists of letting 



Bk,j — (1 — P'^)'i-k<k^Sk^ 



(4.4) 



where B = {Sfcj }fc,j"=i°= , Skj is the Kronecker delta function, and fee G (0, /cmax] 
and /? £ (0, 1] are chosen during a transient adaptive phase, either together or 
one at a time, yielding p.2p . The simple idea is that the curvature is constant 
and prescribed by the prior for sufBciently high-frequencies, and this has already 
been motivated by the discussion in the previous section. The method resulting 
from this proposal will be denoted by C. 

The third level of complexity consists of utilizing a relaxed version of the 



where /? is the scalar tuning parameter as in the above cases, C £ (0, 1] is another 
tuning parameter, and w is any point. The method resulting from this proposal 
will be denoted by H. We consider a low-rank approximation of the resulting 
operator, since it is assumed to be compact. Notice that for this choice of B, 
the covariance of the search direction is given by 



For linear G we let C = 1, and the right-hand side of the above approaches the 
covariance of the posterior distribution as /3 — 0, giving exactly the correct 
rescaling for the search direction while still keeping the prior invariant. The 
linear case is rather special, since we could just do independence sampling of 
the posterior, but it provides a good benchmark. In practice, adaptation of /? 
in this case leads to small, but non-zero, (3. For nonlinear G we choose C < 1: 
and the scaling of the search direction is commensurate with a prior-inflated 
version of the local curvature, given by the first term of (I4.6p . The second 
term is a perturbation which is larger for the (changing) directions dictated by 
the Hessian of the log-likelihood and approaches zero for directions which are 
dictated by the prior, in which the curvature is constant. Provided C is chosen 
small enough, and the curvature does not change too much, the step-size of a 



^In the case of a multi-modal distribution, this would have to be updated at least each time 
the chain passes to a new peak of the distribution. However, in the case of high-dimensional 
sampling, it is rare that one might expect to sample more than a single peak within a single 
chain, since it is expensive to compute individual samples and many such samples are usually 
necessary to reach the first passage time. Therefore, we separate the issues of finding a peak 
of the distribution, where gradient information is invaluable, with exploring the peak, where 
gradient information is expected to be less vaulable. On the other hand, curvature information 
may be useful in the former, and is surely invaluable in the latter. 




I -B 



j3''{C^/^DG{C^/^wy DG{C^/^w)C^/^ + (j^I)-^ 
C^I^DGiC^l^w)* DG{C^l^w)G^I^ . 



(4.6) 
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given move will not exceed the scale of the distribution. Furthermore, the step- 
size will interpolate appropriately between the direction of the smallest scale 
and the directions which are determined by the prior, corresponding to scale 
1. We note that a whole array of different proposals are possible by choosing 
different Bn, and surely there are better options since an exhaustive search has 
not been carried out. 

5. Numerical Results 

In this section, we will investigate the performance of the MH methods 
introduced in the previous section on two prototypical examples. First, we 
briefly explore the inverse heat equation on the circle as an illustrative example. 
Next, we look in more depth at the more complicated problem of the inverse 
Navier-Stokes equation on the 2D torus. 

5.1. Heat Equation 

Consider the one dimensional heat equation 

dv d^v 

v{t,0) = v{t,TT) = 0. 

We are interested in the inverse problem of recovering u = v{0,x) from noisy 
observations oi v{l,x): 

y = v{l,x)+r] 

= Gu + T], 

where G = exp{-A}, A = (-^), D{A) ^ H^{n)nH^{n) with n = (0,7r), 
and r] ~ iV(0,72/). We let C = 10'^ A-^, and 7 = 1. 

The problem may be solved explicitly in the Fourier sine basis, and the 
coefficients of the mean and variance Vk of the posterior distribution can 
be represented in terms of the coefhcients of the data yk and the prior Ckk = 
104fc-2: 

ruk = {e-^^' +lQ'k^r\-''\k. 

Vk = (e-2feVio4fc2)-i_ (51) 

The posterior distribution is strongly concentrated on the u\ mode, in the sense 
that mi ;2> vfij and V\ ^ for all j > 1, and the discrepancy between F and 
C will necessitate a very small scalar /3. For the case of scalar weighting O, 
all modes will decorrelate slowly except for ui . See Fig. [2] for a comparison 
between scalar weighting O and curvature-based weighting H. In this case, we 
will omit the intermediate proposal which simply proposes independent samples 
outside a certain radius in the spectral domain. 
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mode dependence of decorrelation, O mode dependence of decorrelation, H 
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Figure 2: Illustration of the autocorrelation of modes iii and ugo, in the case of O (left), and 
H (right). 



5.2. Navier-Stokes Equation 

In this section, we consider the inverse problem of determining the initial con- 
dition of Navier-Stokes equation on the 2 dimensional torus given noisy observa- 
tions. This problem is relevant to data assimilation applications in oceanography 
and meteorology. 

Consider the 2D Navier-Stokes equation on the torus := [—1, 1) x [—1, 1) 
with periodic boundary conditions: 

dtv - lyAv + v\/v + Vp = / for all (x, t) e x (0, oo), 
V-v =0 for all (a;,t) e T2 X (0,oo), 

V ^ u for all e T2 X {0}. 

Here w: x (0, oo) — > is a time-dependent vector field representing the 
velocity, p: x (0, oo) — !■ M is a time-dependent scalar field representing the 
pressure, / : M'^ is a vector field representing the forcing (which we assume 

to be time-independent for simplicity) , and i' is the viscosity. We are interested 
in the inverse problem of determining the initial velocity field u from pointwise 
measurements of the velocity field at later times. This is a model for the situ- 
ation in weather forecasting where observations of the atmosphere are used to 
improve the initial condition used for forecasting. For simplicity we assume that 
the initial velocity field is divergence-free and integrates to zero over , noting 
that this property will be preserved in time. 
Define 

v. := < trigonometric polynomials w : ^ V • ti = 0, / u{x) dx = 

and H as the closure of TL with respect to the (i^(T^))^ norm. We define 
P: (L^(T^))^ — >• to be the Leray-Helmholtz orthogonal projector (see [13|). 



Given k = (fci, ^2)""", define fc-*" := (fc2, —ki)'^. Then an orthonormal basis for H 
is given by : — >• R^, where 

k 



ipk{x) exp(nik ■ x^ 
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for k {0}. Thus for u £ we may write 

fcez2\{o} 

where, since w is a real- valued function, we have the reality constraint u_fc = 
—Uk- Using the Fourier decomposition of u, we define the fractional Sobolev 
spaces 

H':={u^h\ {Ak?yW?<oo^ (5.2) 

with the norm ||u||, where s e R. If A = -PA, the 

Stokes' operator, then = D{A'^/'^). We assume that / e H'^ for some s > 0. 

Let tn = nh^ for n = 0, . . . , A^, and let w„ € M*^ be the set of pointwise values 
of the velocity field given by {v(a;„i, tn)}m£i- Note that each w„ depends on u 
and define Gn - H ^ R^^ by Gn{u) — Vn- We let {?7n}^=i be a set of random 
variables in R^^ which perturbs the points {vn}n=i to generate the observations 
{yn}n=i in I**^ given by 

Vn-^Vn+Tln, n G { 1 , . . . , iV} . (5.3) 

We let y = {Vn}n=i denote the accumulated data up to time T = Nh, with sim- 
ilar notation for rj, and define G: H ^ R^'^ by G(m) = (Gi(u), . . . , Gjv(m)). 
We now solve the inverse problem of finding u from y — G{u) + rj. We as- 
sume that the prior distribution on m is a Gaussian /ig = N(0,Go), with the 
property that /io(-ff) = 1 and that the observational noise {r]n}n=i is i.i-d. in 
R*^, independent of u, with r/i distributed according to a Gaussian measure 
iV(0,7'/). 

We let Gq ~ Ti^A^"^ noting that if u ^ /xq, then u G iJ* almost surely for 
all s < 1; in particular u £ H. Thus fJ-o{H) = 1 as required. The forcing 
in / is taken to be / = V^^E*, where ^E* = (2/7r)(cos(7rfcc ■ x) + sin(7rfcs • x) 
and V"*" = JV with J the canonical skew-symmetric matrix, kc — (5, 5) and 
kg — (—5, 5). Furthermore, we let the viscosity v = 0.1, and the interval between 
observations h = St = 0.05, where St is the timestep of the numerical scheme, 
and = 10 so that the total time interval is T = Nh = 0.5. In this regime, the 
nonlinearity is mild and the attractor is a fixed point, similar to the linear case. 
Nonetheless, the dynamics are sufficiently nonlinear that the optimal proposal 
used in the previous section does not work. We take M — 32^ evaluated at 
the gridpoints, such that the observations include all numerically resolved, and 
hence observable, wavenumbers in the system. The observational noise standard 
deviation is taken to be 7 = 3.2. 

The truth and the mean vorticity initial conditions and solutions at the 
end of the observation window are given in Fig. [31 The truth (top panels) 
clearly resembles a draw from a distribution with mean given by the bottom 
panels. Fig. 2] shows the diagonal of the rescaling operator B (i.e. for (3 — 0) 
for C, given by (4.4) and H, given by (4.5). For O the operator is given by 
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the identity, i.e. all one along the diagonal. Some autocorrelation functions 
(approximated with one chain of 10^ each) are shown in Fig. [5j Also given 
for comparison is the exponential fit to the (acceptance lagged) autocorrelation 
function for the proposal chain for O, i.e. exp{—nf3^Ea/2). This differs for 
each, because p and Ea differ. In particular, for /3 = 0.024 and Ea « 0.41 
(approximated by the sample mean), for C /3 = 0.015 and Ea ?s 0.32, and for H 
P = 0.015 and Ea sa 0.25. Presumably this accounts for the slight discrepancy 
in the autocorrelation of mode given in the top left panel. For the mode 
wi,2 (top right), the difference becomes more pronounced between H and the 
other two. In the bottom left it is clear that the mode ui^ decorrelates very 
quickly for H, while it decorrelates approximately like the proposal chain for 
both other methods. The mode (bottom right) is independently sampled 
for C, so has decorrelation time 1/Ea. It decorrelates almost as fast for H, and 
is decorrelating with the proposal chain for O. 

Figure [5] shows the relative error in the mean (left) and variance (right) 
with respect to the converged values as a function of sample number for small 
sample sizes. The accumulated benefit of H is very apparent here, yielding 
almost an order of magnitude improvement. For each of the proposals, we run 
16 chains of length 10^, both for the benchmark mean and variance used in 
Fig. [SJ and to compare the so-called potential scale reduction factor (PSRF) 
convergence diagnostic [1J|. In Fig. [7] we plot the PSRF for each method on 
the same color scale from [1, 1.005]. Although it cannot be seen for O(left) and 
C (middle) since the values of most modes are saturated on this scale, every mode 
for every method is clearly converged according to the PSRF, which should be 
smaller than 1.1 and eventually converges to 1. We can see for O that the 
low frequencies, where the mean and uncertainty are concentrated, converge 
faster (as indicated in [5]), and other modes converge more or less uniformly 
slowly. C clearly outperforms O in the high frequencies outside the truncation 
radius (where the mean and uncertainty are negligible) and is comparable for 
the smallest frequencies, but may even be worse than O for the intermediate 
frequencies. This may be explained by the different /? and Ea. Again, H quite 
clearly outperforms both of the other two methods. Note that these results are 
for particular parameter choices, and methods of choosing the parameters /?, kc, 
^, and may be sensitive to their choice. 



6. Conclusions 

This manuscript introduces and investigates a new class of proposals for 
function space MCMC, giving rise to a new class of algorithms. These propos- 
als involve an operator which appropriately rescales the proposal step-sizes in 
order to decrease autocorrelation and hence improve convergence rate. For the 
curvature-inspired rescaling H introduced here, there are two design parameters 
which need to be hand-chosen. But, it seems to be effective to choose ( suffi- 
ciently small and then allow f3 to be chosen during a transient adaptive phase, 
as in the other methods. It should be possible to acheive the same result with 
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Figure 3: Truth (top) and expected value of the posterior (bottom). Initial condition is given 
to the left and solution at final time is given to the right. 
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Figure 5: The autocorrelation for each method O, C, and H, for a variety of frequencies. Also 
given for comparison is the exponential fit to the (acceptance lagged) autocorrelation function 
for the proposal chain for O, i.e. exp(— n/3^Ea/2). This differs for each, because /3 and Eq 
differ. 
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Figure 6: The relative error in the mean (left) and variance (right) with respect to the con- 
verged values as a function of sample number for small sample sizes. 
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Figure 7: The PSRF plots for O (left), C (middle) and H (right). The images are saturated 
on the same scale so that H is visible. 

one design parameter. Furthermore, there exists a whole range of more sophis- 
ticated models between those investigated here and the more sophisticated but 
expensive RMHMC Q and Stochastic Newton This gives rise to exciting 
opportunities for future investigation. It will be of great interest to establish 
rigorous theoretical results, such as geometric ergodicity, for these methods. 
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